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Abstract 

We consider the problem of private computation of approximate Heavy Hitters. Alice and 
Bob each hold a vector and, in the vector sum, they want to find the B largest values along 
with their indices. While the exact problem requires linear communication, protocols in the 
literature solve this problem approximately using polynomial computation time, polylogarithmic 
communication, and constantly many rounds. We show how to solve the problem privately with 
comparable cost, in the sense that nothing is learned by Alice and Bob beyond what is implied by 
their input, the ideal top-B output, and goodness of approximation (equivalently, the Euclidean 
norm of the vector sum) . We give lower bounds showing that the Euclidean norm must leak by 
any efficient algorithm. 
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1 Introduction 



Secure and private multiparty computation has been studied for several decades, starting with |2I3 
Any protocol for computing a function of several inputs can be converted, gate-by-gate, to a 
private protocol, in which no party learns anything from the protocol messages other than what 
can be deduced from the function's input/output relation. The computational overhead is at most 
polynomial in the size of the inputs. 

In recent years, however, input sizes in many problems have grown to the point where "poly- 
nomial computational overhead" is too coarse a measure; both computation and communication 
should be minimized. For example, absent privacy concerns, applications may require that a proto- 
col uses at most poly logarithmic communication. General-purpose secure multiparty computation 
may blow up communication exponentially, so additional techniques are needed. In one theoretical 
approach, individual protocols are designed for functions of interest such as database lookup (the 
private information retrieval problem [SIEIEI) and building decision trees |^. Another approach, 
the breakthrough ^3], converts any protocol into a private one with little communication blowup, 
but imposes a computational blowup that may be exponential. 

The approach we follow, which was introduced in , is to substitute an approximate function 
for the desired function. Many functions of interest have good approximations that can be computed 
efficiently both in terms of computation and communication. A caveat is that the traditional 
definition of privacy is no longer appropriate. Instead, a protocol vr computing an approximation 
/ to a function / is a private approximation protocol 10 for / if 

• vr is a private protocol for / in the traditional sense that the messages of vr leak nothing 
beyond what is implied by inputs and /, and, 

• the output / leaks nothing beyond what is implied by inputs and /. 

Several examples were given in jlOj . Another important example, crucial to this article and the 
first non-trivial example to achieve polylogarithmic communication and polynomial computation, 
was given in [Tl| . There, Alice and Bob have vectors a and b of length N, taking integer values in 
the range [—M,M]. Their goal is to approximate the Euclidean norm of the sum, \\a + ^Hg. The 
authors show how to compute an estimate ||a -|- such that, if A; is a security parameter, 

• T^ll« + ^ll2 < ll« + ^IL < I|a + &||2- 

• The protocol requires poly(A; log(M)A^/e) local computation, poly(A; log(M) log(A^)/e) com- 
munication, and 0(1) rounds. 

• No party learns more from the protocol messages than can be deduced from the approximate 
output ||a-|-6||^ and the relevant party's input, and no party learns more from the output 
||a -|- than can be deduced from the exact output ||a -|- 6||2. 

We will make use of this result. 
1.1 Our Results 

Each of two parties has a vector, a and b, and they want a summary for the vector sum c = a + b. 
First, we consider the Euclidean approximate heavy hitters problem, in which there is a parameter, 
B, and the players ideally want Copt, the B largest terms in c, i.e., the B biggest values together 
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with the corresponding indices. Unfortunately, finding Copt exactly requires linear communication. 
Instead, the players use polylogarithmic communication (and polynomial work and 0(1) rounds) 
to output a vector c with ||c — < (1 + e)||copt — cW^- In our protocol, the players learn nothing 
more than what can be deduced from Copt and ||c||2. (We discuss below the significance of leaking 
||c||2.) We can immediately use this result as black box for approximate sparse representations over 
any orthonormal basis such as wavelet or Fourier, with similar costs. We can also use the result as 
a black box for taxicab approximate heavy hitters, i.e., finding c with ||c — c||j^ < (l + e)||copt — c||-^, 
leaking Copt and ||c||2. 

In the basic result, we give an at-most-i?-term representation that is nearly as good (in the 
Euclidean sense) as the best B-ierxn representation and leaks no more than the best i?-term repre- 
sentation and the Euclidean norm. Leaking the Euclidean norm represents a weaker result than not 
leaking the Euclidean norm, but (i) leaking ||c||2 is necessary in some circumstances and (ii) comput- 
ing or approximating ||c||2 is desirable in some circumstances. First, we give a straightforward lower 
bound showing that, for some (reasonable) values of parameters M, N, . . ., computing c leaking only 
Copt requires ^}{N) communication. In fact, for some (artificial) classes of inputs, ^}{N) communica- 
tion is needed unless ||c||2 itself is not only potentially leaked, but actually computed exactly. On the 
other hand, one can regard the Euclidean norm as semantically interesting, so that we can regard 
the top B terms together with the Euclidean norm as a compound, extended summary. In particu- 
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lar, since c is computed, leaking ||c||2 is equivalent to leaking ||c||2 — ||c||2 = \\c — c\\2, i.e., the error 
in our representation, which is a useful and common thing to want to compute. Our protocol indeed 
can be modified to output an approximation ||c — c||^ with ||c — c||2 < ||c — c||^ < (1 + e)||c — c||2, 
so we can regard the protocol as solving two cascaded approximation problems: find a near-best 
representation c, then find an approximation ||c — c||^ to ||c — c\\2- It is natural to expect a protocol 
for c to leak Copt and a protocol for ||c — c||^ to leak ||c — c\\2', while lower bounds prevent that, 
we can compute c and ||c — c||^ simultaneously and guarantee that, overall, we leak only Copt and 

C||2. 

We give a result for taxicab heavy hitters that produces an at-most-i? term representation 
that is nearly as good (in the taxicab sense) as the the best i?-term representation and leaks no 
more than the best i?-term representation and the Euclidean norm. Thus we have shown that 
the private Euclidean norm approximation can be used for non-Euclidean problems. Finally, we 
also give a result for other orthonormal bases that involves little additional algorithmic or privacy 
work, but demonstrates that the basic result can be applied in a variety of interesting applications. 
It says that we provide an at-most-i? term Fourier representation that is almost as good (in the 
Euclidean sense) as the best i?-term Fourier representation and leaks no more than the best i?-term 
representation and the Euclidean norm. The Fourier basis may be substituted by any orthonormal 
basis, such as Hadamard or Wavelet. 

1.2 Related Work 

Other work in private communication-efficient protocols for specific functions includes the Private 
Information Retrieval problem 1171 , building decision trees flSj , set intersection and match- 
ing and fc'th-ranked element 

The breakthrough gives a general technique for converting any protocol into a private 
protocol with little communication overhead. It is not the end of the story, however, because the 
computation may increase exponentially. 

Work in private approximations include JOl that introduced the notion as a conference paper 
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in 2001 and gave several protocols. Some negative results were given in for approximations to 
NP-Hard functions; more on NP-hard search problems appears in Recently, gives a private 
approximation to the Euclidean norm that is central to our paper. 

Statistical work such as also addresses approximate summaries over large databases, but 
differs from our work in many parameters, such as the number of players and the allowable com- 
munication. 

There are many papers that address the Heavy Hitters problem and sketching in general, in 
a variety of contexts. Many of the needed ideas can be seen in [2] and other important papers 
include [11110. 

1.3 Organization 

This paper is organized as follows. In Section we give preliminaries. In Section we present our 
main result. In Section we present lower bounds. 

2 Preliminaries 

2.1 Parameters and Notation 

Fix parameters N, M, B, k, e. We will consider two players, Alice and Bob, who will have inputs, 
a and b respectively, that are vectors of length taking integer values in the range — M to +M. 
Throughout, we will be interested in summaries of size B for the vector c = a + b. For example, 
in the main result, we are interested ideally in the largest B terms of c. A vector c is written 
c = (co, ci, C2, . . . , cjy^i) = Cj6j, where j is an index, cj is a value, 6j is the vector that is 1 at 
index j and elsewhere, and Cj5j, which can be implemented compactly and equivalently written 
as the pair {j,Cj), is a term, in which cj is the coefficient. 

We compare terms by the magnitudes of their coefficients, braking ties by the indices. That 
is, we will say that {j,Cj) < {k,Ck) if \cj\ < \ck\ or both \cj\ = \ck\ and j < k. Thus all terms are 
strictly comparable. A heavy hitter summary is an expression of the form XligA^*'^*- \M must 
be at most B, then the best heavy hitter summary Copt for a vector c occurs where {{i,r]i) : i £ A} 
consists of the B largest terms. 



support supp(a) of a vector o is the set of indices where a is non-zero, {i : Oj 7^ 0}. 

The parameter e is a distortion parameter. We will guarantee summaries whose error is at most 
the factor (1 -|- e) times the error of the best possible summary. 

The parameter A: is a security and failure probability parameter. Algorithms will be expected 
to succeed except with probability and will serve as an upper bound for the allowable 
statistical distance between indistinguishable distributions. 

We will be interested in protocols that use communication po\y{B, log(A^), k, log(M), 1/e), local 
computation poly{B, N, k, log(M), 1/e), and number of rounds that is constant. 

2.2 Approximate Data Summaries 

In the heavy hitters problem, we are given parameters B and N and the goal is to find the B largest 
terms in a vector c of length N. We will be interested in two approximate versions, parametrized 
also by e. In the approximate heavy hitters problem, we want a summary c = rji6i such that 



The Euclidean 




norm is 
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||c — c|| < (1 + e)||copt — c||, where the norms arc, respectively, 2-nornis (in the Euclidean approxi- 
mate heavy hitters problem) and 1-norms (in the taxicab approximate heavy hitters problem). 

In order to describe previous algorithms that are relevant to us, we first need some defini- 
tions. Fix a vector c = (cq, ci, C2, . . . , cat-i) = ^o<i<N'^^i^ whose terms are to = (0, co),ti = 
(1, ci ),..., ^AT-i = (AT — 1, CAT-i). Suppose the sequence iQ,i[,... is a decreasing rearrangement of 
c, i.e., hi > tji > • • • > tji 

Definition 2.1 (Significant index.) Let I C [0, A/") be a set of indices containing i. Then i is a 
{1, 0)-significant index for c if and only if cf > 9 ^j^j |cjp. 

That is, an index is signficant if the corresponding value is large compared with all the values. In 
some of the algorithms below, we will find the largest term (if it is sufficiently large), subtract it 
off, then recurse on the residual signal. This motivates the following definitions. 

Definition 2.2 (Significant index set.) Let I C [0, A^) be a set of m indices containing i. Then I is 
a ^-significant index set for c if and only if\/j = 1- ■ -m, t^i, is a (|0, N)\{i''^ ^7-1}) & )-significant 
index. 

That is, in a significant index set for c, the largest term has a significant index; after removing the 
largest term, the new largest term has a significant index, etc. Note that there can be more than 
one ^-significant index set for a given vector. 

Definition 2.3 ( Qualified index set.) Fix parameters I and 9. The set Q = {i^, i[, . . . , is a 

{i, ^)-qualified index set for c if and only if 

• m < £, 

• {ip, i[, . . . , i'm-i} is a 9-significant index set, and 

• {i'g, i'l, . . . , i'^n-iji'm} is NOT a 9-significant index set. 

That is, a qualified index set consists of the largest possible length m for a prefix of Zq, ^j^, . . . , 
such that, for each 7 < m, we have c^, > 9ic^, -\- c^, + c^, -I- ■ ■ ■ -I- c^_i). In particular, 

if the terms happen to be in decreasing order to begin with, i.e., if |co| > |ci| > • • •, then a 
qualified index set is {0, 1, 2, . . . , m — 1} for the largest m such that, for each j < m, we have 

C] > 9{c] + Cj+i + c2+2 + • • • 

Note that for each £, 9, and vector c, there is only one {£, ^)-qualified index set for c. We use 
Qc/,e to denote it. We sometimes write Q^fi when c is understood. 
The following are straightforward. 

Proposition 2.4 For any 9\ < 92, Qe,e2 set is a subset ofQ^^Oi- 

Proposition 2.5 Fix parameters N, M, B, k, e and vector c as above. Ifc = J2i£Q e ^i^i' 
then \\c — c\\2 < (1 + e)||copt — c||2. 
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Proof: Assume without loss of generality that |co| > |ci| > • • • and let q = \Q(.^b^_i_\. If q = B, 



then c = Copt and we are done. Otherwise we have 



I _ II- — \ I I- _i_ II _ ||2 
I C C\\rf — / \Ci\ ~r \ \ Copt 1 1 2 



|q|^ + ||co 

q<i<B 



< S|CgP + llCopt - CII2 



< 



e 



1 + e 



2 , ii„ .i|2 
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whence 



1 + e 



|~_ ||2 ^ II _ ||2 
|C CII2 _^ ll^opt '-'II2* 



The result follows. I 

The algorithms below will work from a linear sketch of a vector. 

Definition 2.6 (Sketch of a vector.) Given a vector c, a linear sketch of c is Rc, where R is a 
random matrix generated from a prescribed distribution, called the measurement matrix. 

In our case, as is typical, the matrix R will be a pseudorandom matrix, that can be generated 
from a short pseudorandom seed. We will use sketching for the norm_estimation protocol, in 
which the generator needs to be secure against small space, and a different measurement matrix in 
the the non-private Euclidean Heavy Hitters protocol, where, e.g., pairwise independence suffices 
for the pseudorandom number generator. 

An algorithm in connection with the Euclidean approximate heavy hitter problem satisfying 
the following is known: 

Theorem 2.7 Fix parameters N,M,B,k,e as above. Fix 9 > poly(log(A^), log(M), i?, fc, l/e)^"'^ . 
There is a distribution on sketch matrices R and a corresponding algorithm that, from R and sketch 
Rc of a vector c, outputs a superset ofQc,B,e, in iime poly(log(A^), log(M), i?. A;, 1/e). 

In particular, the number or rows in R and the size of the output is bounded by the expression 
poly(log(A/'), log(M), i?. A;, 1/e) in accordance with the time bound on the algorithm. 

Note that the algorithm returns a superset of Qc,B,e but that even Qc^sfi itself suffices for a 
good approximation. 

Proof: [sketch] One such algorithm is as follows. As in |12j . one can estimate Cj by q = 6f R^ Rci: 
(e/i?)||c||2 except with small probability, where i? is a ibl-valued matrix with poly (log ( A^), 1/e) 
independent rows, each of which is a pairwise independent family. By repeating 0{k) times and 
taking a median, one can drive down the failure probability to 2~'^. As in ^2], one need not 
estimate all the terms; rather, in time poly(log(A^), log(M), B,k,l/e), one can find a set I of indices 
that includes all terms with magnitude at least 0||c||2 (and possibly other terms). By adjusting 
parameters, one can estimate such Cj well enough as Cj so that \ci — Ci^ < (e/-B)||c||2. To get a 
superset of a qualified set, subtract off the approximation to Ci5i and repeat as long as new Cj (or 
better approximations to old Cj) are found that are large compared with the residual vector. At 
most 0(log(MA^)) repetitions are needed since, after 0(log(MA^)) repetitions, we have reduced 

1 1 1 1 2 9 

||c||2 from its initial value of at most M N to its least possible positive value of 1. I 
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2.3 Privacy 

Secure multiparty computation allows two or more parties to evaluate a specified function of their 
inputs while hiding their inputs from each other. We work in the semi-honest model, which assumes 
that the adversary is passive and can't modify the behavior of corrupted parties. In particular, the 
computation is only concerned with the information learned by the adversary, and not with the 
effect misbehavior may have on the protocol's correctness. 

We briefly review private two-player protocols in the semi-honest model. A two-party com- 
putation task is specified by a (possibly randomized) mapping g from a pair of inputs (a, b) € 
{0, 1}* X {0, 1}* to a pair of outputs (c, d) G {0, 1}* x {0, 1}*. Let vr = (tta, vtb) be a two-party pro- 
tocol computing g. Consider the probability space induced by the execution of vr on input x = (a, b) 
(induced by the independent choices of random inputs rA^rs)- Let view^(x) (resp., view^(x)) de- 
note the entire view of Alice (resp., Bob) in this execution, including her input, random input, 
and all messages she has received. Let output^ (x) (resp., output^(x)) denote Alice's (resp., Bob's) 
output. Note that the above four random variables are defined over the same probability space. 
Two distributions (or ensembles) Pi and P2 are said to be computationally indistinguishable with 
security parameter k, T^i = 'D2, if, whenever Xi ~ Pi and X2 ~ P2 and for any function C having 
a circuit of size at most 2^^, we have then | Pr(C(Xi) = 1) — Pr(C(X2) = 1)| < 2~^. 

Definition 2.8 Let X be the set of all valid inputs x = (a, 6). A protocol ir is a private protocol 
computing g if the following properties hold: 

Correctness. The joint outputs of the protocol are distributed according to g{a,b). Formally, 

{(output^ (x),output^(x))}xex = {(5(a(x), 5B(x))}xex, 
where (g^i^), gsi^)) is the joint distribution of the outputs of g{x). 
Privacy. There exist probabilistic polynomial-time algorithms S a, Sb, ca/Zed simulators, such that: 

{(5A(a,9A(x)),5B(x))}x={a,6)GX ^ {(view^(x), output|j(x))}xex 
{(5A(x),5B(6,5fij(x))}x=(a,fe)GX = {(output^ (x),view|j(x))}xex 

There are efficient general techniques: 

Proposition 2.9 (General-Purpose Secure Multiparty Computation (SMC) \2U^ ) Two parties hold- 
ing inputs X and y can privately compute any circuit C with communication and computation 
0{k{\C\ -\- \x\ -\- \y\)), where k is a security parameter, in 0(1) rounds. 

Private approximation requires further discussion. 

Definition 2.10 (Private Approximation Protocol ([lOj)) A two-party protocol tt is a pri- 
vate approximation protocol for a deterministic, common-output function g on inputs a and b if n 
computes a (possibly randomized) approximation g to g such that 

• g is a good approximation to g (in the appropriate sense) 

• IT is a private protocol for g in the traditional sense. 
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(Functional Privacy.) There exists a probabilistic polynomial-time simulator S such that: 



{'5(5'(x))}x=(a,fe)ex = ^(x). 



In our case, g{a, b) will formally be the pair (copt, IjcHg) and g{a, b) will be c. We will informally say 
that we "approximate Copt leaking only Copt and ||c||2," since there is a simulator that takes Copt and 
||c||2 as input and simulates the approximate output c and the protocol messages. Equivalently, one 
could define g{a,b) to be the pair (copt, ||copt — c||) and define g{a,b) to be the pair (c, ||c — c||^), 
where ||-||^ is an approximation to the Euclidean norm (see below). 

In our case of a deterministic function to be output to both Alice and Bob, a (weakly) equivalent 
definition is as follows, known as the "liberal" definition in jlUj : 

Definition 2.11 A two-party protocol tt is a private approximation protocol for a deterministic, 
common- output function g on inputs a and b in the liberal sense ifn computes a (possibly random- 
ized) approximation g to g such that 

• g is a good approximation to g (in the appropriate sense) 

• There exists a probabilistic polynomial-time simulators Sa and Sb such that: 



Roughly speaking, the equivalence is as follows. Suppose there are simulators in the standard 
definition. Then, putting g = g, a simulator for the liberal defintion can be constructed by sim- 
ulating g{a, b) = g{a, b) from g{a, b) using the hypothesized simulator for functional privacy, then 
simulating Alice's view from ^(a, b) and a using the hypothesized simulator traditional simulator 
for the protocol that computes 'g. In the other direction, suppose there is a simulator in the liberal 
definition. Let r be a transcript of Alice's view except for input a. (As it turns out, it is not neces- 
sary to include a in r. If a is much longer than r — as in our situation — we want to avoid including 
a in T in order to keep r short.) Define 'g = g.T to be g with r encoded into its low-order bits. We 
assume that this kind of encoding into approximations can be accomplished without significantly 
affecting the goodness of approximation; in fact, we will assume that the value represented does 
not change at all, even if the "approximate" value is zero — that is, r is auxiliary data rather than 
an actual part of the value of g. Note that a protocol for g also serves as a protocol for g. It is 
trivial to simulate the messages of the protocol given a and g. Use the hypothesized simulator in 
the liberal definition to show functional privacy. 

We will use the technique of encoding into the low-order bits in our main result, which, formally, 
will be proven in the standard definition. We remark that the NORM_estimation protocol from ^1] 
is presented in the liberal definition. 

We will need the following standard definition. 

Definition 2.12 (Additive Secret Sharing) An intermediate value x of a joint computation is 
said to be secret shared between Alice and Bob if Alice holds r and Bob holds x — r, modulo some 
large prime, where r is a random number independent of all inputs and outputs. 



{5A(a,C/(x))}x=(a,6)GX 

{5ij(6,5(x))}x=(„_b)ex 



c 



{view^(x)}xex 
{view^(x)}xeJC 



c 
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The Private Sample Sum problem is as follows. 

Definition 2.13 (Private Sample Sum) At the start, Alice holds a vector a of length N and 
Bob holds a vector b. Alice and Bob also hold a secret sharing of an index i. At the end, Alice and 
Bob hold a secret sharing of Ui + 6,. 

That is, neither the index i nor the value Cj + 6j becomes known to the parties. Efficient protocols 
for this can be found (or can be constructed immediately from related results) in |191 llUj . under 
various assumptions about the existence of Private Information Retrieval, such as in [0]. 

Proposition 2.14 There is a protocol private-sample-sum for the Private Sample Sum problem 
that requires poly {N,k) computation, poly(log(A^), A;) communication, and 0(1) rounds. 

Our results also rely on the following protocol from jl4j . that privately approximates the Eu- 
clidean norm of the vector sum. 

Proposition 2.15 (Private I2 approximation) 114)1 Suppose Alice and Bob have integer-valued vec- 
tors a and b in [—M, M]^ and let c = a + b. Fix distortion e and security parameter k. There is a 
protocol NORM_ESTlMATlON that computes an approximation \\c\\^ to \\c\\2 such that 

• Y^||a + &||2 < ll« + &IL < Ik + &||2- 

• The protocol requires poly {klog{M)N/e) local computation, poly(A;log(M) log(A^)/e) commu- 
nication, and 0(1) rounds. 

• The protocol is a private approximation protocol for \\c\\ in the sense of Definition \2. fH 

Furthermore, the protocol's only access to a and b is through the matrix-vector products Ra and Kb, 
where R is a pseudorandom matrix known to both players. 

3 Private Euclidean Heavy Hitters 

We consider the setting in which Alice has signal a of dimension N , and Bob has signal b of the 
same dimension. Let c = a + b. Both parties want to learn a representation c = 'Ylt&To^t ^ ^^^^ that 
||c — CII2 < (1 + e)||c — Copt||2 and such that at most Copt and ||c||2 is revealed. A protocol is given 
in Figure n 

3.1 Analysis 

First, to gain intuition, we consider some easy special cases of the protocol's operation. For our 
analysis, assume that the terms in c are already positive and in decreasing order, cq > ci > • • • > 
cn-i > 0. We will be able to find the coefficient value of any desired term, so we focus on the set 
of indices. Let /opt = {0,l,2,...,i? — 1} denote the set of indices for the optimal B terms. Thus 

Qc,Bfi ^QrR -L. ^ ^pt and Q^j^_0_ ^ /. 

The ideal output is /opt, though any superset of Qc,B,e suffices to get an approximation with 
error at most (1 -|- e) times optimal. This includes the set / 3 Qc,B,e which the algorithm has 
recovered. The set Ib of the largest B terms indexed by / contains Qc,B,e, so Ib is a set of at most 
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PRIVATE_EUCLIDEAN_HEAVYJH[ITTERS 



Known structural parameters: N, M, B,e,k, which determine 6 = B{i+e) ^^'^ ^' 
Individual inputs: vectors a and b, of length N, with integer values in the range [— M, M]. 
Output: With probability at least 1 — 2"*^, a set Tout of at most B terms, such that 



|2 



c - EteTo„, ills < (1 + e) c - EteT^,, * 



2 



2 



1. Exchange pseudorandom seeds (in the clear). Generate measurement matrices i?i and i?2- 
Alice locally constructs sketches Ria and i?2a = (i?2'^, -R^*^, • • • -R^^^a), where the measure- 
ment matrix Ri is used for a non-private Euclidean Heavy Hitters and the measurement ma- 
trix i?2 = (-^25 -^25 • • • I ^2~^) is used for B independent repetitions of norm_estimation. 
Bob similarly constructs Rib and i?2&- 

2. Using general- purpose SMC, do 

• Use an existing (non- private) Euclidean Heavy Hitters protocol to get, from Ria 
and Rib, a secret-sharing of a superset / of Q^q _e_, in which / has exactly B' < 
poly(log(iV), log(Af), B, k, 1/e) indices. (Pad with arbitrary indices, if necessary.) 

3. Use PRIVATE-SAMPLE-SUM to computc, from /, a, and b, secret-shared values for each index 
in /. Let T denote the corresponding set of secret-shared terms. (Both the index and value 
of each term in T is secret shared.) Enumerate 7 as 7 = {io, ii, . . .} with ti^ > > ■ ■ ■. 

4. Using SMC, do 

• for j = to B - 1 

(a) From i?^, R{a, R^b, to, ti, . . .,ti^_^. sketch = c - (ti^ + + • • • + as 
R'^Tj = {Ria + Rib - Ri{ti, +U,+--- + U^_,)). 

^i\\2 ^ ll'jll^' cauioijiiig, Yq:jii'jii2 



(b) use NORM_ESTlMATlON to estimate ||rj||2 as ||rj||^, satisfying Y^||rj||2 < 

Ikj II 1 — Ikj Il2' 

(c) If \ci. 1^ < ^||rj|| ^, break (out of for-loop) 

(d) Output tj 

5. For technical reasons, encode the pseudorandom seeds for 7?i and i?2 into the low-order 
bits of the output or (as we assume here) provide Ri and 7^2 as auxiliary output. 

Figure 1: Protocol for IIk^ Euclid(vui Heavy Ilillors proV^lcui 
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B terms with error at most (1 + e) times optimal. If \Qc,B,e\ = B, then Ib = Qc,B,e = -^opt, and Ib 
is a private and correct output. 

The difficulty arises when \Qc,Bfi\ < -B, in which case some of Ib may be arbitrary and should 
not be allowed to leak. So the algorithm needs to find a private subset /out with Qc,b,9 ^ /out ^ Ib- 
The challenge is subtle. Let s denote \Qc,B,e\- If the algorithm knew s, the algorithm could easily 
output Qc,B,e, which is the indices of the top s terms, a correct and private output. Unfortunately, 
determining Qc,B,e or s = \Qc,B,e\ requires Q{N) communication (see Section 0}, so we cannot 
hope to find Qc,B,e exactly. Non-private norm estimation can be used to find a subset /out with 
Qcse lout ^ R _fl_ ^ /opti which is correct, but not quite private. Given |/outL the contents 
of /out ^ /opt are indeed private, but the size of /out is, generally, non-private. Fortunately, if we 
use a private protocol for norm estimation, |/out| remains private. We now proceed to a formal 
analysis. 

Theorem 3.1 Protocol private_Euclidean_heavy_hitters requires poly(iV, log(M), _B, fc, 1/e) 
local computation, poly(log(A^), log(M), /?, /c, 1/e) communication, and 0(1) rounds. 

Proof: By existing work, all costs of Steps 1 to El are as claimed. Now consider Step 0J Ob- 
serve that the function being computed in Step has inputs and outputs of size bounded by 
poly(log(A^), log(M), /?, fc, 1/e) and takes time polynomial in the size of its inputs. In particular, 
the instances of norm_estimation do not start from scratch with a reference to a or b; rather, 
they pick up from the precomputed short sketches and /?2^- It follows that this function can 
be wrapped with SMC, preserving the computation and communication up to polynomial blowup 
in the size of the input and keeping the round complexity to 0(1). I 

We now turn to correctness and privacy. Let /out denote the set of indices corresponding to the 
set Tout of output terms. 

Theorem 3.2 Protocol private_Euclidean_heavy_hitters is correct. 

Proof: The correctness of Steps HI and |21 follows from previous work. In Step 01 we first show that 

Qb,^^ ^ /out. 

We assume that Y^||rj||2 < llT'jlll < Ikill2 always holds; by Proposition 12.151 this happens 

with high probability. Thus, if |cij^ > B{i+e) Il^ill2' ^^^"^ 1*^*^1^ - B{i+e) W'^jWl - B{i+e) II^^H 

By construction, Qa g C /. A straightforward induction shows that, if ?' G Qr ^ , then 

iteration j outputs ti and the previous iterations output exactly the set of the j larger terms in /. 



By Proposition 12.51 since /out is a superset of Qb " ■ if c = J2iGi Ci.(5i., then ||c — cL < 

' S(l-t-6) out 3 J 



(1 -|- e)||copt — cllg, as desired. 



Before giving the complete privacy argument, we give a lemma, similar to the above. Suppose 
a set P of indices is a subset of another set Q of indices. We will say that P is a prefix of Q if 
i G P, tj > ti, and j £ Q imply j G P. 

Lemma 3.3 The output set /out is a prefix of Qb ^ except with probability . 

' S(1 + e)^ 
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Proof: Note that Qn " is a subset of / and Qb " is a prefix of the universe, so Qn " 
is a prefix of /. The set /out is also a prefix of /. It follows that, of the sets /out and Qb " ., , 

' S(l+e)^ 

one is a prefix of the other (or they are equal) . 

So suppose, toward a contradiction, that e is a proper prefix of /out- Let q = 

Qb " , so g is the least number such that ig is not in Qb e ■ If the protocol halts 
ocfore considering q, then /out ^ ^ . a contradiction. So, in particular, we may as- 

' S{l + e)^ 

sume that q < B (so the for-loop doesn't terminate). Then, by definition of Qb,—^^-^^ we have 
IcjqP < Ej>g|CbP- It follows that 



l|2 



S(l + e)2"'«"2 

<- e II ||2 
- B(l + e)ll''^l'~- 

Thus the protocol halts without outputting t„, after outputting exactly the elements in Qb e - 

■ 

Finally, we turn to privacy. 

Theorem 3.4 Protocol PRIVATE_EUCLIDEAN_HEAVY_HITTERS leaks no move than 1 1 c 1 1 2 cind Copt ■ 

Proof: With the random inputs Ri and R2 encoded into the output, it is straightforward to show 
that Protocol PRIVATE_EUCLIDEAN_HEAVY_HITTERS is a private protocol in the traditional sense 
that the protocol messages leak no more than the inputs and outputs. This is done by composing 
simulators for PRIVATE-SAMPLE-SUM and SMC. It remains only to show only that we can simulate 
the joint distribution on {c,Ri,R2) given as simulator-input Copt and ||c||. We will show that Ri 
is indistinguishable from independent of the joint distribution of (c, i?2), which we will simulate 
directly. 

First, we show that Ri is independent. Except with probability 2"^^'^), the intermediate set / 
is a superset of Qb ^ and the norm estimation is correct. In that case, the protocol outputs a 

'S(l+e)^ 

prefix of QjR ^ and we get identical output if / is replaced by Qr ^ . Also, Qb ^ can 

be constructed from Copt and ||c||2. Since the protocol proceeds without further reference to /?i, 
we have shown that the pair (c, /Z2) is indistinguishable from being independent of R\. It remains 
only to simulate (c,R2). 

"■^ I 1 9 1 1 1 1 2 

Note that the output c does depend non-negligibly on R^. If |ci^ | is very close to 6'||rj||2, then 
the test \ci-^ < ^||rj|| ^ in the protocol may succeed with probability non-negligibly far from and 
from 1, depending on R2, since the distortion guarantee on ||rj|| ^ is only the factor (1 it e). 

The simulator is as follows. Assume that the terms in Copt are to,ti, . . . ,tB-i with decreasing 
order, io > > • • • > tB-i- For each j < B, compute Ej = ||c — (io + ti + • • • + tj-i) II2 = 
I|c|l2 ~ 11*0 + ii + • • • + ij-i|l2 ^-^d then run the NORM_ESTlMATlON simulator on input Ej and e 
to get a sample from the joint distribution {Ej,R2), where Ej is a good estimate to Ej. Our 
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simulator then outputs ti- if \ci-\'^ > -^jj^Ej, and halts, otherwise, following the final for-loop of 
the protocol. Call the output of the simulator s = ti_^5i.. 

Again using the fact that a prefix of On ^ is output, if j G Qn g ., , then = 7; i.e., the 

'B(l + e)^ ^ ^ 'i3(l+e)^ ' ■' 

1 1 2 

j'th largest output term is the j'th largest overall, so that, if j is output, we have E^ = llr^Hg. Thus 
[Ej^R'i) is distributed indistinguishably from (||rj|| ^, i?2)- The protocol finishes deterministically 
using / and ^ and the simulator finishes deterministically using Qd e and Ej, but, since 
the protocol output is identical if / is replaced by the distributions on output (c, R2) of 

the protocol and R2) of the simulator are indistinguishable. I 

In summary. 

Theorem 3.5 Suppose Alice and Bob hold integer-valued vectors a and b in [—M,M]^, respec- 
tively. Let B, k and e be user-defined parameters. Let c = a + b. Let Topt be the set of the largest 
B terms in c. There is an protocol, taking a, b, B k and e as input, given Topt o,nd \\c\\2, computes 
a representation c of at most B terms such that: 

• ||c- CII2 < (1 + e)||copt - cllg. 

• The algorithm uses poly {N,log{M), B,k,l/e) time, poly(log(A'"), log(M), S, A;, 1/e) commu- 
nication, and 0(1) rounds. 

• The protocol succeeds with probability 1 — 2^'^ and leaks only Copt and \\c\\2 with security 
parameter k. 

Corollary 3.6 With the same hyptotheses and resource bounds, there is a protocol that computes 
c and an approximation \\c — c\\^ to \\c — c\\2 such that j^\\c — c\\2 < ||c — c||^ < ||c — and the 
protocol leaks only Copt and \\c — cHj. 

Proof: Run the main protocol and output also ||c — c||^, which is computed in the course of the 
main protocol. Note that ||c — = ||c||2 — ||c||2 and both ||c||2 and c are available to the main 
simulator (as input and output, respectively), so we can modify the main simulator to compute 
||c — CII2 as well. I 

3.2 Extension to Taxicab Heavy Hitters 

In this section, we show that our result of Euclidean approximation can be extended to approximate 
taxicab heavy hitters. 

Lemma 3.7 Let c be the output o/private_Euclidean_heavy_hitters. //||c — c||2 < (l + e)||c — 
Coptib, then ||c-c||i < (1 + \/Se)||c - Copt||i. 

Proof: Let {i,Ci) be the largest term which is not in Qg, ^ ■ Prom Theorem 13.51 we know 
(Z]i<i<Bc|)5 < Ve{J2B<j<N'^'j)^- Using the fact that -^^=^== ||x||i < ||j;||2 < ||x||i for any 
signal X, we get 

\B<j<N I B<j<N 
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Thus we have ||c - c||i < T.i<j<N Icjl = Ei<j<B9 + T.B<j<N Icjl = {V^ + ^)T.B<j<N Icjl = 

l)||c-Copt||l. I 

Theorem 13.81 follows directly: 

Theorem 3.8 Suppose Alice and Bob hold integer-valued vectors a and b in [—M,M]^, respec- 
tively. Let B, k and e be userdefined parameters. Let c = a + 6. Let Topt be the set of the 
largest B terms in c. There is an protocol, taking a, b, M, N,B,k and e as input, and computes a 
representation c of at most B terms such that: 

• ||c- c||^ < (1 + e)||copt - c||^. 

• The algorithm uses -poly {N,log{M), B,k,l/€) time, poly(log(A^), log(M), A;, 1/e) commu- 
nication, and 0(1) rounds. 

• The protocol succeeds with probability 1 — 2~'^ and leaks only Copt and \\c\\2 with security 
parameter k. 

3.3 Extension to other Orthonormal Bases 

In this section, we consider other orthonormal bases, such as the Fourier basis. Alice and Bob hold 
vectors a and b as before, and want the B largest Fourier terms — frequencies and corresponding 
coefficient values. The exact problem requires Q{N) communication, so they settle for an approxi- 
mation, namely, they want a i?-term Fourier representation c such that ||c — c\\2 < (l+e)||copt — cW^, 
where Copt is the best possible B-teim Fourier representation. 

We note that a straightforward generalization of our main result solves this problem privately 
and efficiently. Alice and Bob locally compute the inverse Fourier transform F~^a and F~^b of their 
vectors a and b. Because the Fourier transform is linear, x = F~^c = F~^a + F^^b. Alice and Bob 
now want to compute an approximation to the ordinary heavy hitters for the vector x. Suppose 
the result is x. Then x is the compact collection of Fourier terms and c = Fx is the corresponding 
approximate representation of c. By the Parseval equality, since the Fourier basis is orthogonal, for 
any y, we have ||y||2 = ||-^?/||2 = H-^^"""?^ ||2- follows that ||c — c\\2 < (l + e)||copt — if and only if 

— 3;||2 < (l + e)|| Xopt — 3; 1 1 2 1 so the algorithm is correct when transformed to the Fourier domain. 
It also follows that leaking ||c||2 is equivalent to leaking ||Fc||2, so the algorithm is private when 
transformed to the Fourier domain. Alice and Bob require the additional overhead of computing a 
Fourier transform locally, which fits within the overall budget. 

4 Lower Bounds 

In this Section, we show some lower bounds for problems related to our main problem, such as 
computing an approximation to Copt without leaking ||c||2. The results are straightforward, but we 
include them to motivate the approximation and leakage of the protocols we present. 

Theorem 4.1 There is an infinite family of settings of parameters M,N,B,k,e such that any 
protocol that computes the Euclidean norm exactly on the sum c of individually-held inputs a and 
b, uses communication i}(N). Similarly, any protocol that computes the exact Heavy Hitters or 
computes the qualified set Qc,i,i exactly uses communication Q{N). 
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Proof: Consider the set disjointness problem, which requires i}(N) communication |16j . Ahce 
and Bob hold {0, l}-valued vectors a and b of length A'^ such that each of a and b has exactly {N/4) 
I's and the supports are either disjoint or intersect in exactly one index. The task is to determine 
the intersection size. Then, if c = a + 6, we have ||c||2 = N/2 or ||c||2 = N/2 + 3, depending on 
the size of the intersection, so a protocol for ||c||2 can be used to solve the set disjointness problem. 
Similarly, finding the one largest heavy hitter solves the set disjointness problem. 

Now consider vectors of length + 1 in which indices to — 1 directly code an instance of 
set disjointness as above and index A^ has a value that is always N/2 + 2. Then |<5c,i,i| = 1 or 
|Qc,i,i| = depending on the norm of indices to A^ — 1, which requires communication Q(N) to 
determine. I 

The above theorem motivates our study of approximate heavy hitters, for which there are 
protocols with exponentially better communication cost than the exact heavy hitters problem. The 
next theorem motivates leaking the Euclidean norm, by showing that any efficient protocol for the 
approximate heavy hitters problem leaks the Euclidean norm on all instances within a class. 

Theorem 4.2 There is an infinite family of settings of parameters M,N,B,k,e such that any 
protocol that solves the Euclidean Heavy Hitters problem on the sum c of individually-held inputs a 
and b, leaking only Copt, uses communication ^}{N). Furthermore, for an infinite class of inputs in 
which \\c\\2 is not constant, any such protocol either computes \\c\\2 or uses communication 0,{N). 

Proof: Consider vectors c of one of two cases, given by random permutations of the following 
vectors: 

N/2-1 

(2A^,TX^^~~1,0,0,... ,0), (easel) 

' N/2-1 

{2N,N,N,... ,N,0,0,... ,0), (case 2). 

Fix B = 1 and e ^ 1/A^. A correct protocol finds the top term in case 1. In case 2, it turns out 
that the correctness requirement is vacuous, but, fortunately, the privacy requirement is useful. A 
protocol leaking only Copt must behave indistinguishably in cases 1 and 2 since Copt is the same, so 
a private protocol reliably finds the the top coefficient in case 2. Since a protocol for case 2 can be 
used to solve the set disjointness problem, such a protocol uses i}{N) bits of communication. In 
particular, any protocol either behaves differently on the two cases — thereby computing ||c||2 for 
inputs in the union of the two cases — or uses communication Q{N). I 

Note that the above theorem also shows that it is impossible in some cases to solve the approx- 
imate taxicab heavy hitters problem efficiently without leaking the Euclidean norm. 

Although the class of inputs above is contrived, the (implied) parameter settings are natural, 
i.e., log(M), log(A^), B,k,l/e can be made to be polynomially related, etc. 
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